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Abstract 

This  year  we  participate  in  the  TREC  Session  Track  Task  1.  We  adopt  the  Query  Change  Model  (QCM), 
weighted  QCM,  re-ranking,  clustering,  and  error  analysis  in  our  approaches.  The  QCM  retrieval  model  is 
employed  to  combine  all  queries  in  a  session.  QCM  allows  documents  that  are  relevant  to  any  query  in  a 
session  to  appear  in  the  final  retrieval  list.  Weighted  QCM  combines  queries  unevenly  based  on  a 
prediction  of  query  quality.  It  is  based  on  the  following  intuition:  if  a  query  does  not  bring  any  document 
that  leads  to  a  SAT-Click  from  the  user,  it  suggests  that  this  query  is  poorly  formed.  Our  re-ranking 
module  is  based  on  implicit  feedback  from  the  user;  in  this  case  the  SAT-Clicked  documents.  The  module 
boosts  a  document’s  ranking  position  if  it  has  been  SAT-Clicked  in  the  session  or  in  other  sessions  that 
share  similar  search  topics.  We  apply  K-means  clustering  algorithm  to  detect  which  sessions  share  similar 
search  topics.  Each  unique  term  is  one  dimension  of  the  vector  and  is  weighted  by  its  idf.  We  also  apply 
session  error  analysis  in  RL3.  From  the  query  log,  we  first  identify  sessions  with  similar  topics  by 
clustering,  then  we  use  SAT-Clicks  from  most  sessions  to  re-rank  the  documents  for  the  sessions  that  the 
algorithm  predicts  as  poorly  issued  sessions,  i.e.  more  difficult  session  due  to  ill-form  queries.  Combining 
above  approaches,  we  achieve  a  20.9%  nDCG@10  increment  and  a  13.0%  P@10  increment  from  RL1  to 
RL2,  and  with  utilization  of  the  whole  log  data,  we  achieve  a  4%  nDCG@10  increment  and  a  0.5%  P@10 
increment  from  RL2  to  RL3. 

1.  Introduction 

Session  search  involves  multiple  search  iterations  triggered  by  query  reformulations  to  accomplish  a 
complex  search  task.  In  our  groups’  2013  work  [1],  we  model  this  interactive  process  of  session  search  as 
a  MDP  process.  In  our  2014  work  [3][4],  we  model  it  as  a  POMDP  process.  TREC  2014  Session  track 
Task  1  intends  to  test  whether  we  can  utilize  user  interactions  with  a  search  engine  in  a  session  to  improve 
search  accuracy.  The  task  data  includes  log  data  of  1021  sessions.  The  log  data  of  each  session  records  a 
sequence  of  queries  qi,q2,  . . .  ,qn-i,qn  triggered  by  users,  where  qn  is  the  current  query  in  the  session.  The 
log  also  contains  retrieved  ranking  lists  for  each  past  query,  qi  to  qn_i.  Finally  the  log  data  collects  user- 
clicked  documents/snippets  and  the  dwell  time  that  users  spend  on  each  clicked  document.  There  are  three 
subtasks,  RL1,  RL2  and  RL3.  RL1  ignores  all  information  in  the  session  log  and  only  relies  on  the  current 
query  to  retrieve  results.  RL2  uses  only  information  from  current  session  to  retrieve.  RL3  uses  any 
information  in  the  session  log  to  retrieve. 

We  apply  different  technologies  in  each  sub  tasks.  In  RL1,  we  directly  feed  the  last  query  of  a  session  to 
Lemur  Search  Engine.  The  retrieval  algorithm  is  set  as  Language  Modeling  with  Dirichlet  smoothing.  The 
smoothing  parameter  mu  is  set  as  5000.  In  RL2,  we  adopt  QCM  algorithm  [1]  where  we  combine  all 
queries  in  a  session  to  formulate  effective  structured  queries.  Each  search  term  is  assigned  with  a  weight, 
which  is  calculated  based  on  whether  the  term  occurs  in  previous  SAT-Clicked  documents  and  whether 
the  term  is  newly  added  or  removed  from  previous  query.  Further  more  we  decrease  a  previous  query’s 
weight  if  the  query  didn’t  bring  any  document,  which  leads  to  a  SAT-Click  from  user.  Finally  we  boost  a 
document’s  ranking  score  if  it  has  been  SAT-Clicked  in  the  session.  In  RL3,  we  also  apply  QCM  and 
decrease  query  weight  if  no  SAT-Click  documents  are  retrieved  by  it.  And  we  boost  a  document’s 
ranking  score  if  it  is  SAT-Clicked  in  sessions  that  belongs  to  the  same  or  similar  topic.  We  identify 
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similar  topics  by  clustering  sessions  based  on  query  similarity  using  K-means  clustering  algorithm. 
Another  tactic  we  applied  in  RL3  is  to  replacing  bad  session’s  retrieval  results  with  good  session’s  results 
whose  search  topic  is  similar.  We  evaluate  session’s  performance  based  on  user’s  click  numbers. 

We  organize  this  paper  as  follow.  We  discuss  each  technical  approach  in  detail  from  Section  2  to  Section 
7.  In  Section  8,  we  present  our  submissions  and  the  evaluation  results.  In  Section  9,  we  conclude  our 
work. 

2.  Ad-hoc  Retrieval  Model  (Ad-hoc) 

Our  RL1  approach  directly  uses  the  current  query  of  each  session  as  search  terms.  The  retrieval  algorithm 
is  Language  Modeling  with  Dirichlet  smoothing  [2],  The  document  d’s  relevance  score  towards  a  search 
term  t  is  calculated  by  formula: 

tf(t,d)  +  nP(t\C) 
q  length(d )  +  q 

where  length(d)  is  document  d’s  length.  P(t|C)  is  the  probability  that  term  t  appears  in  corpus  C.  p  is  the 
Dirichlet  smoothing  parameter  and  is  set  to  5000  in  our  experiment. 

3.  The  Query  Change  Retrieval  Model  (QCM) 

In  session,  users  modify  queries  to  better  express  their  information  needs.  In  Query  Change  Retrieval 
Model  (QCM)  [1],  query  changes  are  considered  as  relevance  feedback  to  adjust  query  term  weights.  First, 
it  defines  Aqi=qi-qi_i  as  the  query  change  between  two  adjacent  queries,  q,_]  and  q*.  Then  Aq,  is  divided 
into  three  parts:  the  added  terms  (+Aq;)  the  removed  terms  (— Aq ,)  and  the  theme  terms  (qtheme)- 


Table  1  A  Query  Change  Example  (TREC  2014  Session  52) 


Session 

Queries 

Query  Change 

Qtheme 

Session52 

qi  =  hydropower  efficiency 
q2  =  hydropower  environment 
q3  =  hydropower  damage 

+Aq2  =  environment 
-Aq2  =  efficiency 
+Aq3  =  damage 
-Aq3  =  environment 

hydropower 

Table  1  presents  an  example  of  query  changes  in  TREC  2014  Session  track.  In  query  q;,  query  term 
weights  are  adjusted  based  on  four  types  of  strategies,  WTheme,  W Add, in,  WAdd,out  and  WRemove  [1]-  The 
relevance  score  between  query  q;  and  a  document  d  becomes: 


Score(qi  ,  d)  logP(q;  |d)  +  C^WTheme  P^Add.In  T  ^^Add.Out  Remove 

Parameters  a,  (3,  e  and  8  are  the  linear  weighting  coefficients  for  each  type  of  strategies.  They  are  set  as 
a=2.2,  (3=1.8,  e=0.07  and  8=0.4  in  our  submission.  The  QCM  model  combines  all  queries  in  a  session 
using  formula: 

n 

Scoreqcm(q^n,d)  =  ^Yn~l  Scorefa,  d) 

i= 1 

where  y  is  the  discount  factor  for  the  prior  queries  in  the  session.  In  TREC  Session  track’s  setting, 
evaluation  is  based  on  the  whole  session.  The  prior  queries  are  equally  important  as  current  query,  hence 
we  set  y  as  1  in  our  experiment. 

4  Weighted  QCM 

QCM  allows  documents  that  are  relevant  to  any  query  in  a  session  to  appear  in  the  final  retrieval  list. 
When  set  parameter  y  =  1,  we  combine  all  queries  in  a  session  evenly.  However  we  argue  that  queries 


shouldn’t  be  evenly  combined.  Here  we  define  two  concepts,  Strong  SAT-Clicked  document  and  Weak 
SAT-Clicked  document.  Strong  SAT-Clicked  document  means  a  retrieved  document  that  has  been 
clicked  by  a  user  and  he/she  dwelled  more  than  30  seconds  on  this  document.  Weak  SAT-Clicked 
document  is  also  a  clicked  document  but  with  dwell  time  more  than  10  seconds  and  less  than  30  seconds. 

We  assume  that  dwell  time  on  a  clicked  document  indicates  how  relevant  that  document  is.  If  a  query 
doesn’t  bring  any  document  that  leads  to  a  SAT-Click  from  the  user,  it  indicates  that  this  query  is  poor 
formed.  Hence  these  queries’  weight  should  be  decrease.  Weighted  QCM  combine  queries  based  on  query 
quality.  Poor  formed  queries’  weight  is  decreased  by  a  factor  oo  E  (0,  1).  Its  score  function  is: 


Score 


wqcm 


0h..n-<0=  ^  Scoreqcm(qi,d')  +  a>  ^  Scoreqcm(qj,  d) 

Qi^Qgood  Qj^Qbad 


Qgood  is  the  query  set  in  which  every  query  brings  at  least  one  SAT-Click  from  users.  While  Qbad  is  the 
query  set  in  which  every  query  brings  zero  SAT-Click  from  users.  The  current  query  is  an  exception.  It 
brings  zero  SAT-Click  because  it  has  no  retrieval  results  yet,  however  it  belongs  to  Qgood. 

5  User-Click  Model 

Since  SAT-Click  indicates  a  document’s  relevance,  we  boost  a  document’s  ranking  score,  if  it  is  SAT- 
Clicked  by  users. 

5.1  Session  Level  User-Click  Model 

In  this  approach,  we  only  use  information  in  the  current  session.  We  boost  a  document’s  ranking  score  if 
it  has  been  SAT-Clicked  in  the  current  session.  The  score  function  is: 

Scoresession_cnC]C[q-L  n,  d)  —  Scoreqcm(^q^  n,(T)  +  Scoresession_boost(q-L  n,  d) 

iJj\Str  ong  SAT Clicks  d  \  +  6  \W eak  SAT Clicks  d\ 


Score 


i-boost(Rl..n>  ^0 


'ZdiEsession('P\StrongSATClicksd.\  +  d\WeakSATClicksd.\) 


Where  |  StrongS ATClickSdl  is  the  number  of  times  that  document  d  is  strongly  SAT-Clicked  in  the  current 
session.  |WeakSATClickSd|  is  the  number  of  times  that  d  is  weakly  SAT-Clicked.  The  boosting  score  is 
normalized  by  the  total  number  of  SAT-Clicks  in  the  session.  We  experimentally  set  ip=2  and  0=1. 

5.2  Topic  Level  User-Click  Model 

This  approach  is  similar  to  the  Session  Level  User-Click  Model.  The  difference  is  that  instead  of  only 
using  the  information  in  the  current  session,  we  utilize  information  in  all  sessions  that  share  similar  search 
topics.  We  cluster  sessions  based  on  their  search  topics.  The  cluster  algorithm  is  described  in  detail  in 
Section  6.  We  boost  a  document’s  ranking  score  if  it  has  been  SAT-Clicked  in  sessions  that  share  similar 
search  topics  with  the  current  session.  The  score  function  is: 

Scoreciuster_ciick(q1  n,  d)  —  Scoreqcm(qi  n,  d )  +  Scoreciuster_boost(qi  n,  d') 

\p\StrongSAT  Clicks  d\  +  8\WeakSATClicksd\ 


Scoreciuster_boost(q1^n,  d)  — 


'ZdiECiuster('P\StrongSATClicksd.\  +  6\W  eakS  AT  Clicks  d.\) 


Where  Cluster  is  a  set  of  sessions  that  share  similar  search  topics  with  the  current  session. 
|StrongSATClickSd|  is  the  number  of  times  that  document  d  is  strongly  SAT-Clicked  in  the  Cluster. 
|WeakSATClickSd|  is  the  number  of  times  that  d  is  weakly  SAT-Clicked.  The  boosting  score  is 
normalized  by  the  total  number  of  SAT-Clicks  in  the  Cluster.  We  also  set  xp=2  and  0=1. 


6  Clustering 


We  cluster  sessions  based  on  search  topics  by  comparing  queries’  similarity  between  different  sessions. 

•  First,  we  combine  all  queries  in  one  session  and  convert  it  into  a  term  vector.  Each  unique  search 
term  is  one  dimension  of  the  vector. 

•  Then,  we  assign  terms’  idf  value  as  weight  to  each  term  dimension. 

•  Finally,  we  cluster  sessions  based  on  the  Euclidean  distance  of  their  query  vectors. 

We  use  K-means  clustering  algorithm  and  set  K  as  60,  which  is  the  number  of  distinctive  topic  ids  in  the 
log  file.  This  number  may  not  be  obtainable  in  a  real  search  environment.  We  can  train  it  or  choose  a 
relatively  large  K  in  such  situation.  Other  clustering  algorithms  without  requirement  of  predetermination 
of  cluster  numbers  could  be  other  alternatives,  however  we  didn’t  explore  them  in  our  experiments. 

7  Session  Performance  Prediction  and  Replacement 

We  detect  a  specific  schema  in  sessions  that  share  similar  search  topics,  most  of  which  contain  SAT- 
Clicks,  however  a  few  do  not.  It  indicates  that  for  the  few  sessions,  the  bad  retrieval  results  may  be  caused 
due  to  ill  formed  queries  rather  than  difficult  search  tasks.  For  these  sessions,  we  replace  their  retrieval 
results  with  good  session’s  results  whose  search  topic  is  similar. 


Table  2  Features  Extracted  from  Session  Data  Log 


Feature 

Definition 

F\ 

The  user’s  intent  of  session  s  is  to  make  comparison  among  two  or  more  items. 

f2 

The  user  did  not  click  any  retrieved  document  in  session  s. 

F, 

^dwell  ^^S. 

Fa 

#  of  unique  terms  in  the  session  s>20. 

f5 

t(3) 

dwell_per_chck 
t dwelljper click  ^  ^ 

f6 

Session  s  does  not  contain  the  most  frequent  term  in  T(s). 

f7 

#  of  unique  terms  in  session  s<6 

Fs 

2 siet(s )  #  °f  SAT  clicks  in  session  s' 

#  of  SAT  clicks  in  session  s  < - 

J  \T(s)\ 

In  order  to  identify  good  sessions  from  bad  sessions  automatically,  we  extract  eight  features  from  session 
click  data  log.  For  convenience,  we  introduce  some  symbols  firstly.  For  each  session  s,  let  us  use  tdwen  to 
denote  the  user’s  total  dwell  time  in  the  whole  session  and  calculate  the  average  dwell  time 

tdwell_per_click 


_  _ tdwell _ 

ldwell_per_click  -  #  0f  clicks  in  session  s' 

Then  all  the  average  dwell  times  are  sorted  in  a  descending  order, 

t(i)  t(2)  tO) 

Ldwell_per_click'  Ldwell_per_click'  Ldwell_per_click' 

Moreover,  we  use  T(s )  to  represent  the  topic  cluster  including  s.  Based  on  the  above  symbols,  all  the 
features  can  be  listed  in  Table  2. 

Here,  Ft  is  set  up  to  deal  with  a  shortcoming  of  QCM.  According  to  our  experience,  when  applying  QCM 
to  session  search,  the  nDCG  scores  are  often  small  in  case  that  the  user  try  to  compare  several  items  in 
one  session.  For  example,  the  user  may  want  to  compare  different  infant  developmental  milestones 
depending  on  culture  through  posing  a  query  like  “culture  difference  in  milestones  in  0-12  month  olds”. 
This  is  an  example  from  the  1 1 1th  session  in  Session  Track  2014.  We  treat  one  session  as  this  kind  when 
the  queries  include  terms  with  patterns  like  “compare”,  “differ”,  “versus”,  “vs”  and  “v.s.”. 


All  the  eight  features  are  Boolean,  i.e.  should  be  TRUE  or  FALSE.  For  each  feature  F,-(z'=  1,2,..., 8),  we 
count  the  number  of  sessions  satisfying  F,=TRUE.  For  each  session  s,  an  estimation  score  scorers)  is 
calculated  as  follows: 


score, 


(s)=Z 

i= 1 


#  of  sessions  satisfying  F;  =  TRUE 


KFi) 


where  /(Fj)  is  an  indicator  function.  It  returns  1  if  session  s  satisfies  feature  F„  otherwise  it  returns  0.  All 
the  sessions  are  ranked  according  to  their  estimation  scores.  The  top  1/3  sessions  are  regarded  as  bad 
sessions  and  the  others  are  regarded  as  good  sessions. 

8  Experiments 

8. 1 .  Data  preparation 

We  run  our  experience  on  dataset  Cluewebl2  CatA.  It  consists  of  733,019,372  English  web  pages, 
collected  between  February  10,  2012  and  May  10,  2012.  Spam  documents  are  filtered  out  based  on  their 
Waterloo  Spam  scores. 

8.2.  Submission 


Table  3  TREC  2014  Session  Track  Submissions 


GUS14RUN1 

GUS14RUN2 

GUS14RUN3 

RL1 

•  Ad-hoc  Retrieval  Model 

•  Ad-hoc  Retrieval  Model 

•  Ad-hoc  Retrieval  Model 

RL2 

•  Weighted  QCM 
(eo=0.65) 

•  Session  Level  User- 
Click  Model 

•  Weighted  QCM  ((o=0.8) 

•  Session  Level  User- 
Click  Model 

•  Weighted  QCM  ((o=0.8) 

•  Session  Level  User-Click 

Model 

RL3 

•  Weighted  QCM 
((0=0.65) 

•  Topic  Level  User-Click 
Model 

•  Weighted  QCM  (co=0.8) 

•  Topic  Level  User-Click 
Model 

•  Weighted  QCM  (co=0.8) 

•  Topic  Level  User-Click  Model 
using  topic  ids 

•  Session  Performance 

Prediction  and  Replacement 

Table  4  nDCG@10  and  P@10  for  top  100  sessions 


GUS14RUN1 

GUS14RUN2 

GUS14RUN3 

Max 

Med 

nDCG@10 

P@10 

nDCG@10 

P@10 

nDCG@10 

P@10 

nDCG@10 

P@10 

nDCG@10 

P@10 

RL1 

0.2053 

0.378 

0.2053 

0.378 

0.2053 

0.378 

0.3890 

0.629 

0.1549 

0.348 

RL2 

0.2458 

0.426 

0.2482 

0.427 

0.2482 

0.427 

0.4865 

0.712 

0.1626 

0.372 

RL3 

0.2443 

0.423 

0.2458 

0.424 

0.258 

0.429 

0.5111 

0.744 

0.1790 

0.404 

Table  3  lists  our  submissions  in  TREC  2014  Session  Track  and  their  configurations.  We  submit  three  runs 
in  total:  GUS14RUN1,  GUS14RUN2  and  GUS14RUN3.  Each  run  contains  three  ranking  lists,  one  for 
task  RL1,  one  for  task  RL2  and  one  for  task  RL3. 

It  is  worthwhile  to  point  out  that  in  GUS14RUN3  task  RL3,  we  apply  Topic  Level  User-Click  differently. 
Here  we  did  not  using  clustering  to  determine  sessions  that  share  similar  search  topics,  instead  we  directly 
apply  topic  id  in  the  log  file  to  determine  session  topic’s  similarity.  By  doing  this  we  can  evaluate  the 
effectiveness  of  applying  the  clustering  method  in  Session  Search.  Further  when  we  apply  Session 
Performance  Prediction  and  Replacement,  we  also  use  topic  id  to  determine  session  clusters.  We  don’t 
use  clustering  to  determine  session  clusters,  because  clustering  is  based  on  comparing  query  similarity.  If 


the  queries  are  similar,  then  the  retrieval  performance  should  be  close  too.  Hence  it  is  difficult  to  find  a 
good  session  to  replace  bad  sessions  when  sessions  are  clustered  by  query  similarity. 

8.3.  Results 

Table  4  shows  the  evaluation  results  of  our  submissions.  The  result  shows  that  by  utilizing  current  session 
information,  we  achieve  a  20.9%  nDCG@10  increment  and  a  13.0%  P@10  increment  from  RL1  to  RL2, 
and  with  utilization  of  the  whole  log  data,  we  achieve  a  4%  nDCG@10  increment  and  a  0.5%  P@10 
increment  from  RL2  to  RL3.  All  submissions  achieve  a  significant  performance  improvement  from  RL1 
to  RL2,  however  only  GUS14RUN3  achieves  a  small  improvement  from  RL2  to  RL3.  It  may  be  caused 
by  the  features  of  Session  track  tasks.  The  search  tasks  are  relatively  complex.  There  are  rich  interactions 
in  the  session  to  help  search  engine  to  infer  user  intent.  However  there  are  few  similar  sessions  can  be 
used  to  recommend  good  documents  for  the  current  session.  GUS14RUN2  RL3  and  GUS14RUN3  RL3’s 
performances  are  close,  which  suggests  that  clustering  sessions  by  query  similarity  is  as  good  as  directly 
using  topic  ids.  GUS14RUN3’s  RL3  gets  highest  P@10  scores  in  20  sessions  out  of  first  100  sessions.  It 
proves  that  our  approaches  are  highly  effective. 

9.  Conclusion 

We  apply  a  combination  of  several  technologies  to  TREC  2014  Session  track.  We  achieve  a  significant 
performance  boost  from  RL1  to  RL2,  and  a  small  improvement  from  RL2  to  RL3.  The  evaluation  results 
suggest  that  1)  considering  previous  queries  and  the  current  query  is  suitable  for  session  search  task;  2) 
user  SAT-Clicks  is  useful  to  estimate  query  quality  and  document  relevance;  3)  clustering  sessions  by 
query  similarity  is  effective;  4)  in  session  search,  a  session  itself  contains  rich  interaction  information 
which  can  be  used  to  improve  search  accuracy. 
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